A Large-Scale Linear Regression Sentiment Model

نویسنده

  • David Sun
چکیده

This report details the findings in building a large-scale linear regression sentiment model for an Amazon book review corpus. We studied and applied a number of regression and NLP techniques, including Unigram/Bigrams, stop-word removal, ridge and lasso regression. 1. DATA CORPUS The corpus contains 975194 non-distinct book reviews collected from Amazon by Mark Dredze and others at Johns Hopkins. Along side the textual reviews we have numerically scored sentiment data on a scale of 1-5. The goal is to learn a linear regression based sentiment model from the textual reviews which can be applied to future reviews to obtain an estimate of the numerical sentiment score. 2. DATA PROCESSING The original data was represented in XML format. A simple tokenizer had been applied to obtain a flattened representation consisting of a reverse-indexed dictionary matrix, which allowed review-text and sentiment scores can be extracted. We obtained document boundaries by segmenting at token positions corresponding to the pair. The textual body of a review is extracted by matching token positions for the pair. Review titles often contain sentiment charged summaries of the review and hence are extracted and concatenated with the review body. Finally, numerical ratings are obtained by matching token positions bewtween the pair. 2.1 Duplicate removal It turns out that many of the reviews in the corpus were duplicates. to remove duplicates, we computed a textual-hash of the first 20 words (or the maximum number of words in the textual-review if less than 20 words) and discarded reviews for which the hash matched a prior review. At the end of this process we obtained 494761 distinct reviews, roughly half the size of the original corpus. 2.2 Featurization We experimented with both bag-of-words and bag-of-bigrams models. Each review is featurized into a multinominal count of words and bigrams. These counts are then normailized via tf-idf scores, computed as: St,d = tft,d × log2 N

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS 294-1: Assignment 2 A Large-Scale Linear Regression Sentiment Model

The primary objective of this assignment was to build a linear regression sentiment model based on amazon.com reviews. The main challenge comprised of handling moderately large amounts of data on a single machine. The different variations that I tried include the following: exact solution (L2 loss and ridge regularization), stochastic gradient with different training schemes and initialization,...

متن کامل

A New Compromise Decision-making Model based on TOPSIS and VIKOR for Solving Multi-objective Large-scale Programming Problems with a Block Angular Structure under Uncertainty

This paper proposes a compromise model, based on a new method, to solve the multi-objective large-scale linear programming (MOLSLP) problems with block angular structure involving fuzzy parameters. The problem involves fuzzy parameters in the objective functions and constraints. In this compromise programming method, two concepts are considered simultaneously. First of them is that the optimal ...

متن کامل

INESC-ID: A Regression Model for Large Scale Twitter Sentiment Lexicon Induction

We present the approach followed by INESCID in the SemEval 2015 Twitter Sentiment Analysis challenge, subtask E. The goal was to determine the strength of the association of Twitter terms with positive sentiment. Using two labeled lexicons, we trained a regression model to predict the sentiment polarity and intensity of words and phrases. Terms were represented as word embeddings induced in an ...

متن کامل

A Compromise Decision-making Model for Multi-objective Large-scale Programming Problems with a Block Angular Structure under Uncertainty

This paper proposes a compromise model, based on the technique for order preference through similarity ideal solution (TOPSIS) methodology, to solve the multi-objective large-scale linear programming (MOLSLP) problems with block angular structure involving fuzzy parameters. The problem involves fuzzy parameters in the objective functions and constraints. This compromise programming method is ba...

متن کامل

Sentiment Analysis using Linear Regression

In this assignment we learn a linear model for determining the rating of textual book reviews from amazon.com using linear regression. Despite its simplicity, the linear model still performs fairly well. However, a clever choice of the various “ingredients” of the model, such as features selection and regularization term, could further improve the its accuracy. In our work we study Unigram feat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012